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Abstract. Genome- wide protein-protein interaction (PPI) data are read- 
ily available thanks to recent breakthroughs in biotechnology. However, 
PPI networks of extant organisms are only snapshots of the network evo- 
i i lution. How to infer the whole evolution history becomes a challenging 

LiJ problem in computational biology. In this paper, we present a likelihood- 

0^ based approach to inferring network evolution history from the topology 

of PPI networks and the duplication relationship among the paralogs. 
Simulations show that our approach outperforms the existing ones in 
,-0 terms of the accuracy of reconstruction. Moreover, the growth param- 

eters of several real PPI networks estimated by our method are more 
consistent with the ones predicted in literature. 
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1 Introduction 
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OQ Recent progress in experimental systems biology provides us with an unprece- 

~f\ dented amount of genome- wide protein-protein interaction (PPI) data [5j. In 

(^") order to obtain a deeper insight into the molecular machinery behind these in- 

CN teractions, many network models have been proposed to study or model PPI 

evolution [51 [T71 [20] • However, PPI networks of extant organisms are only snap- 
^ shots of network evolution, and inferring the whole network evolution history 

*^~j remains a challenging problem in computational biology |12j . 

Unlike many networks studied in technology and sociology, the main growth 
mechanism of PPI network is gene duplication and divergence |19j : when a new 
node is added to the network, it copies all the interactions of an existing node 
designed as the anchor node; subsequently some edges adjacent to one of these 
two nodes are randomly lost. This mechanism was explicitly converted to a 
network growth model by Vazquez et al. in [T5]. Since then many extensions 
have been put forth, see for examples, [3HS1 [131 EH- Here we shall focus on a 
particular one called duplication- mutation with complementarity (DMC) , which 
is the best model to fit the D. melanogaster (fruit fly) PPI network according to 
a recent study by Middendorf et al. [TT] . 



When a growth model is fixed, the problem of reconstructing the evolution- 
ary history of an observed network is to infer the relative order of the nodes 



according to which the network evolved (see Section 2.2 for definitions). Better 
understanding of this problem can provide further insights into not only how 
PPI networks are formed, but also how they will possibly evolve in the future. 
Several approaches to address this problem have been proposed in recent years. 
In order to obtain better ways of predicting protein modules, Dutkowski and 
Tiuryn introduced a Bayesian network framework to infer the posterior proba- 
bility of interactions between ancestral nodes based on a duplication and speci- 
ation model [BJ. A similar approach was used by Pinney |15] to infer ancestral 
interactions between bZIP proteins. Gibson and Goldberg proposed a merging 
algorithm to reconstruct the evolutionary history of PPI networks using gene 
trees |S]. A novel framework for estimating the topology of the ancestral net- 
works based on maximal likelihood was presented by Navlakha and Kingsford 
in [l^] • Recently, Patro et al. [2] used a maximal parsimony approach that 
appends edges in observed networks to duplication history forest. 

Here we introduce a new history inferring framework based on the maxi- 
mal likelihood principle. In contrast to the model-based methods in [12 , our 
approach incorporates not only the topology of observed networks, but also 
the duplication history of the proteins contained in the networks. Although the 
evolution of topology is often determined by some growth mechanisms, the du- 
plication history of the proteins can be inferred independently by phylogenetic 
studies [HJ [15] . After establishing some theoretical results concerning the DMC 
model, we reduce the problem of finding most probable history of ancient net- 
works to an optimization problem, and propose some efficient heuristic algo- 
rithms to solve the latter problem. Simulations show that our method provides 
better inference than the ones in [12]. Moreover, we also applied our algorithm 
to the PPI networks of S. cerevisiae (budding yeast), D. melanogaster and C. 
elegans (worm), and the growth parameters obtained by our approach are more 
consistent with the ones predicted in [71I19J. Finally, we also propose an improved 
measure for comparing two histories. 

The rest of the paper is organized as follows: Section[2]provides the framework 
of reconstruction, including the technical background and the inference method. 
In Section[3]we present the inference results for simulations and real data sets. We 
conclude in Section HI with a brief discussion and some possible related research 
directions. 

2 Methods 

2.1 Modeling Network Evolution 

In the DMC model Ai := A4(p c , p), where p c and p are the two parameters that 
specify the model, we start with an initial graph Go, the so-called seed graph. At 
each time step t, the graph Gt is obtained from Gt-i by the following procedures 
(see Fig.[T]for an illustration): (1) (Duplication) A node ut is chosen uniformly at 
random from the set of nodes in Gt-i, and a new node v t is added and connected 



to every neighbor of Ut ■ Here Ut and Vt are often referred to as the anchor node 
and duplicate node at step t, respectively. (2) (Mutation) For each neighbor of 
u t , say w, we choose one edge from (u t ,w) and (v t ,w) with equal probability, 
and this chosen edge is deleted with probability 1 — p. (3) (Complementarity) 
The nodes Ut and Vt are connected with probability p c . 
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(A) (B) (C) (D) 

Fig. 1: Illustration of the DMC model. (B) is obtained from (A) by one duplica- 
tion step, with node 1 (represented in maroon) as the anchor node and node 4 
as the duplicate node (represented in purple) ; the probability that node 1 is cho- 
sen as the anchor node is 1/3 because the network in (A) contains three nodes. 
(C) is obtained from (B) by the mutation step, which occurs with probability 
p(l —p)/2. (D) is obtained from (C) by the complementarity step, which occurs 
with probability p c . 



Note that the DMC model is Markovian, that is, the probability of obtaining 
Gt when Gt-i is given depends solely on the parameters of M.. For example, 
denoting the network (A) and (D) in Fig [l] by G t -\ and G t , respectively, then 
the probability P(Gt\Gt~i, M.) that Gt is evolved from Gt-i by one step under 
the model M. is p(l — p)p c /2. 

2.2 History Reconstruction 

Given an observed network G, a growth history T-L of G is a graph sequence 
(Gq,Gi,- ■ ■ ,G n ) such that G n = G and for each index t in {1, • • • ,n}, the 
graph G t can be obtained from G t ~\ in one step under the DMC model M.. 
The first graph Go is referred to as the seed graph of the history. In addition, the 
number n is called the span of the history. Clearly, a history % induces a unique 
sequence 9 := 9(H) of duplicate nodes, that is, 9 = (vi, ■ ■ ■ , v n ) such that for all 
t, node Vt is the unique node in Gt, but not Gt-i- 

Given a network G, let % be the growth history we hope to infer. The prob- 
ability of G being evolved according to history Ji, when viewed as a function 
of the unknown history "H, is called the likelihood function L{'H \ G,M) that is 
given by 



L(H\G,M) = Y[v(G t \Gt-i,M). 



We adopt a maximal likelihood approach to infer the history of G as below. 

Problem 1. Given a network G together with a natural number n and model M., 
construct a growth history T-L that maximizes the likelihood L(W | G, M.) among 
all histories with span n. 

This problem is expected to be difficult since the number of possible histories 
grows exponentially, and we are not aware of any results concerning whether 
this problem is polynomial-time solvable. Before introducing a variant of the 
above problem that is more tractable, we present some necessary tools in the 
following two subsections. 

2.3 Duplication Forest 

We begin with duplication history, which is closely related to network history as 
gene duplication is a major driving force of PPI network evolution [15]. The idea 
of encoding the duplication history by a forest of binary tree was used in O E] . 
Patro ct al. [14 incorporated duplication history in a parsimony approach to 
reconstruct network history. 

A growth history "H of a PPI network induces a unique duplication forest. 
Initially, we have a forest _To consisting of isolated nodes that are identical to 
the set of nodes in the seed graph. At each step t, the forest r t is obtained from 
It—i by replacing the anchor node u t with a cherry {ut,vt} consisting of u t and 
the duplicate node Vt ■ Here a cherry {u, v} is referred to a subtree consisting of 
two leaves u and v and the internal node adjacent to them. 

The duplication forest of a PPI network can also be inferred independently 
without using its growth history. For instance, such a forest can be reconstructed 
by the phylogenetic relationships between the genes in the network [15] ■ This 
observation is key to our investigation. 

2.4 Backward Operator 

In this subsection, we will introduce a backward operator that is important in 
our inference framework. 

Consider one step in a growth history, that is, a graph Gt obtained from Gt-i 
in one step by using anchor node u t and duplicate node v t . Now we want to define 
a backward operator 1Z such that Gt-i can be determined by this operator and 
the triplet (G t ,u t ,v t ). To this end, let TZ^(G t ) be the graph obtained from G t 
by merging the two nodes u t and v t in G t , that is, (i) for each neighbor w of v t 
such that w ^ u t and w is not adjacent to Ut, add an edge (w,ut); (ii) delete 
the node v t and all edges incident to it. 

Similarly, the backward operator can be applied to the duplication forest, 
that is, 1Z%f(r t ) is the forest obtained from r t by replacing the cherry {u t ,v t } 
with the leaf u t . Note that this definition is consistent with the above one in 
the following sense: If r t is the duplication forest corresponding to the network 
Gt, then TZ^(r t ) is the duplication forest associated with 7\L™*(G t ). When the 
anchor node ut is clear from the context, we also write H Vt for 72-"'. 



2.5 Growth History with Known Duplication Forest 

Using the backward operator introduced above, we shall introduce a scheme to 
represent a growth history with known duplication forest by a node sequence. 
Throughout this paper, we use the convention that a node sequence consists of 
distinct nodes, while a node list may contain repeated nodes. 

In general, a node sequence 9 = (vi, ■ ■ ■ ,v n ) and a duplication forest r are 
said to be compatible if there exists a (necessarily unique) sequence (Tf , • • • ,T%) 
of forests such that r® = r, and r^_ 1 = lZ Vt (r t ) holds for each t e {1, • • • , n}. 
Note that a necessary and sufficient condition for 9 and r being compatible is 
that Vt belongs to a cherry in rf for each t. Denoting the sibling of Vt in Tf, 
that is, the unique leaf in r t that forms a cherry with Vt, by ut, we say the list 
7r = (ui, • • • , u n ) is the anchor list determined by r and 8. 

As mentioned above, a growth history % = (Go, • • • , G„) specifies a duplicate 
sequence 9 = (v\, ■ ■ ■ ,v n ) and a duplication forest r. Clearly, the sequence 9 and 
forest r must be compatible. On the other hand, given a duplication forest J 1 
associated with a network G and a sequence 9 that is compatible with _T, then 
there exists a unique growth history H such that 9 is induced from T-L. In other 
words, when the duplication forest r is fixed, a growth history % is uniquely 
determined by the duplicate sequence 9 associated with it. In this context, the 
likelihood function is defined as 

n 

L(6\G,r,M):='[[¥(G e i \Gl 1 ,r,M), 

?=i 

where W(Gf\Gf_ 1 ,r,M) is the probability that G\ is evolved from G\_ x in one 
step under the DMC model M. and using the anchor node u t specified by 9 
and r. Note that in general the probability P(Gf \Gf_ 1 ,r, M.) is different from 
F(Gf\Gf_ 1 ,A4). Indeed, the latter can be regarded as an "average" of the former 
over all possible anchor nodes. 

Now, the problem of inferring growth history with given duplication forest, 
a variant of Problem 1 that will be studied in this paper, can be formally stated 
as below. 

Problem 2. Given a network G together with a duplication forest _T and a 
growth model Ai, construct a duplicate sequence 8 such that the likelihood 
L(9 | G, r, M) is maximized. 

In the above problem, the parameters in the DMC model M. are specifically 
mentioned. However, as we shall see later, the parameters of M. are not needed 
for the history inference problem. 



2.6 Theoretical Results 

Here we present some theoretical results that are crucial to solve Problem 2. Due 
to space limitations, all proofs are outlined in the Appendix. 



Lemma 1. Given a network G with duplication forest r, for any two sequences 
81 and 9 2 that are compatible with F, the graph Gq 1 is isomorphic to G 2 . 

Given a duplicate sequence 9 = (vi,V2,--- >v n ), we shall associate it with 
three families of numbers that are crucial to our analysis. For each duplicate node 
Vi in 8, let S(vi) be the indicator function that takes value 1 if vi is connected to 
its anchor node Ui, and otherwise; a(vi) the number of the neighbors shared 
by Vi and Uf, and f3(vi) := j3(vi,Gi) the number of nodes adjacent to Vi or m in 
G\ , but not both. Note that 28{vi) + 2a(vi) + (3(vi) is equal to the sum of the 
degree of Vi and that of m in G\. 

The sum 8(9) := 2j=i ${ v i) is called the complementarity number of history 
8, the sum a(9) := X^ILi a i v i) is called the extension number of 9, and j3(9) := 
Y^i=i P( v i) is called the loss number of 9. 

We complete this subsection with the following two key results. The first one 
shows that the complementarity number and extension number are constants 
over all compatible duplicate sequences. 

Theorem 1. Given a network G with duplication forest r and two compatible 
duplicate sequences 9\ and 9 2 , we have 8(9\) = 8(9 2 ) and a(9\) — a(9 2 ). 

Theorem 2. Given a network G with duplication history r, the ratio of two 
likelihood functions for two duplicate sequences 9\ and 9 2 that are compatible 
with r is given by 

L(di\G,M,r) _ /l-j)\««!)-W 
L(9 2 \G,M,r) ~ V 2 ) 

2.7 Reconstruction Algorithms 

By Theorem [21 solving Problem 2 is equivalent to solving the following problem. 

Problem 3. Given a network G and its duplication forest J", construct a duplicate 
sequence 9 such that the loss number (3(9) is minimized among all sequences 
compatible with Z\ 

In this section, we propose some heuristic algorithms to solve Problem 3, 
and hence Problem 2. The first one is a greedy algorithm called minimal loss 
number (MLN), in which we choose a duplicate node with the smallest value 
f3(v) among all candidate ones. 

To motivate our main reconstruction algorithm, we introduce some further 
notation and results. A duplicate sequence 9\ = (v\,--- ,v n ) is said to be 
swapped from 9 2 = (v[, ■ ■ ■ , v' n ) at position m for some index m e {1, • • • , n— 1} 
if we have v' m = w m +i, v' m+l = v m , and v\ = Vi for all other indices i. 

Lemma 2. Given a network G with duplication forest r , if 9\ and 9 2 are two 
compatible duplicate sequences such that 9\ is swapped from 9 2 at position m, 
then we have G^ = G t 2 for each index i € {0, • • • ,n} with i ^ m. 



Let 9\ and #2 be two compatible duplicate sequences as stated in the above 
lemma. By LemmaJS] and Theorem [2j L{B 1 \ G,T,M) > L{9 2 \ G,T,M) if and 
only if for G m = G„j = G^ , we have 

/3(v m ,G m ) + l3(v m ^,TZ Vm (G m )) < p(v m ^,G m ) + ^v^n^^iGm))- (1) 

Motivated by the above observation, for two cherries {u, v} and {u',v'} in _T 4 , 
we say {u, v} is more favorable than {u',i/}, denoted by {u,v} >- {u',v'}, if 
P(y, G t ) + P{v', K%(G t )) < /3(v', G t ) + /3(v, n^',(G t )) holds. Note that in general 
the relation >- is not transitive, that is, {u,v} y {u' ,v'} and {u 1 ,v'} >- {u* ,v*} 
does not imply {u, v} >- {w*,t>*}. 

Now we present our main inference algorithm called cherry greedy (CG), 
which runs as follows: At every backward reconstruction step, we choose a node 
from the most favorable cherry C, that is, the number of cherries C" with C >- C 
is maximized. If several cherries are equally favorable, we uniformly choose one 
of them. More precisely, starting from Gt := G and r t := -T, we choose a most 
favorable cherry (u, v) from Ft and uniformly choose one node from the cherry, 
say v t , as the duplicate node at this step. Then we construct G t ~i as lZ Vt (G t ) 
and F t _i = lZ Vt (r t ). This process continues until Go is obtained. 

Since the above algorithm is a stochastic one, that is, among a chosen cherry 
{u, v}, u and v has the equal probability of being chosen as the duplicate node. 
Therefore, one natural way of improving its accuracy is to repeat the algorithm 
for a certain times and report the best output, where the number of repetitions 
can be regarded as a tuning parameter. When the real duplicate sequence # rea i is 
known, the best one is defined as the output 9 such that Kendall's r between roa i 
and 9 is maximized (see Section 3 for further details on Kendall's r), otherwise 
the one with the smallest loss number is chosen. This strengthened version of 
the CG algorithm with be refereed to as CGR, where 'R' stands for repetition. 



2.8 Estimating Parameters 



From the results in Section [2^6] and Section [277| it is clear that the parameters 
of the DMC model are not used in our inference framework. Moreover, here we 
will present a method by which the parameters can be established after a growth 
history being inferred. 

To this end, assume that a growth history 7-L — (Go, • • ■ , G„), together with 
the duplicate sequence (i>i, • • • , v n ) and anchor list (ui, ■ ■ ■ ,u n ), is given. Note 
that for each neighbor w of node Ui in G;_i, the probability that w is adjacent 
to both Ui and Vi in Gi is p. In other words, the extension number a{vi) at z-th 
step, i.e., the number of the common neighbors shared by Ui and Vi in Gi, has the 
binomial distribution with parameters p and /3(ui) + a(vi), where 0(ui) + a(i>i) 
is the number of neighbors that Ui has in Gi-\. On the other hand, the random 
variable 8(vi) has Bernoulli distribution with parameter p c . Therefore, we are led 
to propose the estimators p = Rig\ +a (g\ an d Pc = -^ to estimate the parameters 
p and p c respectively. 



3 Results 

Our reconstructing algorithms, minimal loss number (MLN) and cherry greedy 
(CG), have been implemented in Perl, which is available upon request. Given 
a network G and duplication forest _T, each outputs a hypothetical duplicate 
sequence 9 that approximates the one with the minimal loss number. 

To assess the performance, we need to measure the difference between the 
inferred duplicate sequence and the 'real' one. One popular index for this purpose 
is Kendall's tau K T [TJ[T2]. Formally, for two sequences 6\ — {v\, ■ ■ ■ ,v n } and 
&2 = {v'i, • • • , v' n } that consist of the same set of nodes, K T (di,$2) is defined as 

n(n — 1) 

where n c is the number of concordant pairs, that is, the number of pairs in 9\ 
that are in the correct relative order with respect to #2jand n^ is the number 
of discordant pairs. Note that we have i4T T (#1,6*2) = 1 if the two sequences are 
identical, and K T (9±,92) = —1 if they are exactly opposite. 

3.1 Simulation Validation 

To validate our algorithms, we generated 100 random network using each DMC 
model A4, where the parameters p c and p ranged from 0.1 to 0.9 at 0.2 intervals. 
Each network has 100 nodes and is evolved from the same seed graph K 2 (i.e., 
the graph with two nodes and one edge). 

For each simulated network G, its duplication forest r and duplicate sequence 
#reai were recorded. Next, we reconstructed duplicate sequences using our algo- 
rithms. The one using MLN is denoted by #mln, and the one using CG by 9cg- 
We also considered the algorithm CGR, which outputs 9cgr, the one with the 
highest Kendall's r among ten runs of CG. We ran some of the experiments 
more than 10 times but found that more runs did not improve the results much, 
and hence we ran 10 times throughout. For comparison, we also generated a 
random duplicate sequence 9 rll nd, which can be interpreted as a 'null model'. 
Finally, we computed K T (9 1 - caU 9) for 9 G {6> ran d, #mln, 9 G g,&cgr}- 

The results for K T {9 Yea \ 1 6* ran d) and K T {9j. aa x, ^mln) are summarized in Fig.p] 
Our results for K T {9^ ca \, 6* ran d) agree well with the theoretical mean of ^(^rcai, #rand), 
which is 0. In addition, the results for K T {9 roa \,9cG) and -ftTi-(0reai,#CGR) are 
summarized in Fig. [3j From these results, we can see that compared to ran- 
dom duplicate sequences, our algorithms have improved the values of Kendall's 
t substantially. In addition, in general CG has better performance than MLN. 
Finally, repeating algorithm CG a few times will increase the performance. 

3.2 Comparison with Existing Methods 

In this subsection, we compare the performance of our algorithm CG with 
Net Arch, the inference method introduced in [12]. Since duplication forest is 
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Fig. 2: Results for simulation data sets. The figure in the left is the heat 
map representing the values of -Kr (0reai,$rand), and the one in the right is for 
K T (9 Ica i, 0mln)- Here the value of Kendall's r is represented by the intensity of 
color. 



not incorporated in the framework proposed in |12j . it would be expected that 
CG will outperform NetArch. 

Indeed, Fig. [3] already shows that our algorithm CGR outperforms NetArch 
because in [T2] , the authors claimed that the values of Kendall's r between the 
real duplicate sequence and the one constructed by their method are between 
0.2 and for the same set of combinations of parameters. 

Even without using repetition, CG also outperforms NetArch in general. We 
demonstrate this by comparing the performance of them over 100 simulated 
random networks. For each simulation, we generated a pair of parameters p and 
p c uniformly from the interval (0, 1), and one graph G with 30 nodes from the 
seed graph K^ using the DMC model A4. As above, the duplication forest r and 
duplicate sequence f5 rea i were recorded. Next, both NetArch and CG were used to 
reconstruct the duplicate sequence, and their outputs were denoted by f3 Not and 
f?CG history. Finally, the values t\ := K T {6 vea \,9cG) and r 2 := K T {d Tea \, 0Net) 
were computed. 

Among the 100 simulated networks, CG outperforms NetArch 87 times, and 
the distributions of Ti — t\ and t\ — r 2 are summarized in Fig. |4a| Note that for 
the cases when CG outperforms NetArch, the gains in terms of Kendall's tau is 
significant, i.e., the average value is 0.2. 

Moreover, we also compared the parameters p and p c estimated by using CG 
with the ones p best and p b ^ st obtained by the method in [TJ] . Fig • 



41) 



are the box 

plots for the errors of these four estimations, in which the data are calculated as 
\p — p\, etc. Note that the closer to 0, the better the estimation is. We can see 
that our method has smaller means of errors and smaller length of confidence 
intervals for both p and p c . 
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Fig. 3: Results for the algorithm CG and CGR. The figure in the left is the heat 
map for K T (9 Iea x, Ocg) an d the one in the right for K T (9 roa \, Ocgr)- I n CGR we 
run CG for 10 times and report the output with the highest Kendall's r. 



3.3 Application to Real PPI Networks 

We downloaded 460 gene trees reconciled in [BJ. The gene trees contain genes 
from S. cerevisiae (budding yeast), D. melanogaster (fruit fly) and C. elegans 
(worm). For each gene tree, we used the genes of one species and deleted all 
the genes from the other two species to create a gene duplication forest for 
each species. In addition, we downloaded corresponding PPI networks from the 
database DIP ( http://dip.doe-mbi.ucla.edu/dip/Main.cgi). Since the gene trees 
obtained in this way are timed, we can infer from them a duplicate sequence 
6** cal that approximates the real duplicate sequence. 

When we checked the gene trees, we found that some of them, especially the 
large ones, are very asymmetric about the root, which are not common for the 
duplication trees associated to networks generated by the DMC model. To handle 
this asymmetry, we modified our inference algorithm CG by taking account the 
depth of leaves (i.e., the number of edges between the leave and the root). More 
precisely, in each backward step we choose the most favorable cherry among the 
cherries whose depth is larger than a threshold. The output of this modified CG 
algorithm will be denoted by CG . 

The values of r = K T (6* eai , GG ) f° r * ne three networks are listed in Tablefll 
In addition, the corresponding estimated parameters p and p c are also listed. 
Note that these estimations are consistent with those in [3 [15], where the authors 
asserted that p and p c are smaller than 0.1. Since the one obtained in [T2] is 0.7, 
here we also demonstrate the advantage of incorporating duplication history in 
growth history reconstruction. 



3.4 An improved measure 

Typically one cannot distinguish between a duplicate node from its anchor node. 
Therefore, while Kendall's tau between two sequences is natural for comparing 
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Fig. 4: (a) Box plot for differences between two methods. t\ is the Kendall r 
obtained by CG and r 2 is obtained by NetArch. For n — r 2 , we only consider 
the cases t\ > r 2 , and likewise for t 2 — n . (b) Box plot for errors of estimations 
of parameters. Here parameters are uniformly generated from the interval (0, 1). 



Table 1: The Kendall's r and estimated parameters for three PPI networks. 





S. cerevisiae 


C. elegans 


D. melanogaster 


p 


0.061142 


0.020976 


0.025953 


Vc 


0.053215 


0.048443 


0.024182 


r 


0.378 


0.316 


0.473 



duplicate sequence, it also inherits the intricate difficulty of separating anchor 
nodes from duplicate nodes. To overcome this problem, we propose an alternative 
measure to compare two duplicate sequences, by which the 'symmetry' between 
anchor nodes and duplicate nodes is taken into account. 

To begin with, each internal node of the duplication forest r is labeled by 
a unique label. Note that each duplicate sequence 9 that is compatible with 
r induces a unique sequence ~f(9) by replacing duplicate node Vi with the 
label of the parent of Vi in if. For two duplicate sequences 9\ and # 2 , let 
K*(8i,02) := -ftT T (7((9i),7(# 2 )), and we argue this is a more appropriate mea- 
sure since here we do not make a distinction between anchor nodes and dupli- 
cate nodes. Using the simulated networks obtained in Section [XT] we present in 



Fig. [5] the results for K*(0 Ieah 9 comp ) and K*(0 Ieal ,d C G), where 6> comp is a du- 
plicate sequence uniformly chosen from all compatible sequences. These results 
also validate our algorithm CG as K*(6 Iea x, #cg) is higher than K*(9 rea \, 9 comp ). 
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Fig. 5: Results measured by K*. The figure in the left is for K*(9 rca \, 9 comp ) and 
the one in the right for K*(8 rca \, #cg)- Here the simulated networks are the same 
as the ones used in obtaining Fig. [2| 



4 Discussion 

Assuming the observed network is the result of a growing mechanism as depicted 
in the DMC model, we have presented a likelihood-based algorithm for recov- 
ering the most probable network evolutionary history by exploiting the known 
duplication history trees of paralogs in the observed network. Through a series of 
reduction of the search space of all histories to (i) compatible duplicate sequences 
and (ii) the set of favored duplicate nodes, we have provided a computationally 
efficient algorithm. Our approach successfully re-traces the network evolution 
especially in the scenario that the labels of ancestor nodes are not necessarily 
to be one of the duplicates. As a useful by-product of our reconstruction, we 
propose natural estimators for the model parameters which are of independent 
interest. Our approach can be applied to infer the order of duplication events and 
to trace the topological characteristics of networks as they evolve. Our method, 
though described in the context of the DMC model, can be adapted to other 
network growing models. In addition, it can potentially be extended to predict 
the emergence of interactions and modules during the network evolution, and 
hence to provide comparison of the evolution history across different species. 
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Appendix 

Proof of LemmaY]\ Assume that r consists of k binary trees T, , ■ ■ ■ , T&, and 9 is 
a duplicate sequence compatible with r. For each graph G in the graph sequence 
{Gq, • • • , G„}, we can associate it with a graph 11(G) as follows. The vertex set 
of 11(G) is {1, ■ • • , k} and two distinct vertices i and j are adjacent if and only if 
there exist some adjacent nodes gi and gj in G such that gi is a leaf in the tree 
Ti and gj is a leaf in Tj . 

Let G be a graph in {G,,-- ■ , G^}. Denote the anchor node and duplicate 
node corresponding to this graph by u and v, respectively. Since 9 is compatible, 
u and v are the leaves in the same tree in r. Note that for any vertex g that is 
distinct from u and v, then g is adjacent to u or v in G if and only if g is adjacent 
to u in 7\L"(G). Therefore, we can conclude that 11(G) = H(1Z"(G)), and hence 
also II(Gq) = II(G^). On the other hand, from the construction we know that 
II(Go) is isomorphic to Gq. 

In consequence, for two compatible duplicate sequences 9, and 9 2 , since G^ 1 = 
G n = G^ 2 , we can conclude that Gq 1 and G 2 are isomorphic, as required. □ 

Proof of Theorem [II We shall establish the lemma by induction on the number 
of cherries in r. The base case that r contains no cherry is trivial, because this 
implies n = 0. 

Now assume that r contains m cherries, and that the lemma holds when the 
number of cherries in the duplication forest is at most m — 1. Fix a cherry {u, v} 
in r and choose a label g that is not used before. Consider the network G* that 
is obtained from 7\L"(G) by relabeling u with g, and the duplication forest r* 
obtained from r by replacing the cherry {u, v} with a leaf labeled as g. Note 
that either node uoru (possible both) must appear in the duplicate sequence 
of 9\ ; we replace them with g and denote the sequence with the first g removed 
by 9\. Then 9\ is a duplicate sequence that is compatible with P* . 

Similarly, the sequence 9^ obtained from #2 in the same way is also compatible 
with P* . Now the induction assumption implies 6(9*) — 5(^|)- Together with 

6(9,) - S(9t) = 6(9 2 ) - S(9* 2 ), 

we have 6(9\) = 6(9 2 ), as required. 

On the other hand, the number of edges increased from G\_ x to G\ is given 
by 6(vi) and a(vi), where Vi is the duplicate node. Together with Lemmalll this 
implies 

6(9 X ) + a(9 2 ) = \E(G n )\ - \E(G e J)\ = \E(G n )\ - \E(G^)\ = 6(9 2 ) + a(9 2 ). 

Since 8(9,) = 8(9 2 ), wc have a(9,) = a(9 2 ). □ 

Proof of Theorem^ Let 9 = {v,, ■ ■ ■ , v n } be a duplicate sequence that is com- 
patible with the duplication forest r. By Lemma[l]and Theorem[l] it is sufficient 
to note that 

L(0|G,M,r)=^V ( V W , 



holds with q := (1 — p)/2, an observation following from that 

F{G $ i \G 6 i _ 1 ,r,M)=p 5 c (vi) p a{vi) q (3{vi) 
holds for each i £ {1, • • • , n}. □ 

Proof of Lemma Lgt Clearly, we have G^ = G^ for i > m. To show this also 
holds for i < m, it suffices to show G r ^_ 1 = G 6 I ^_ 1 For i £ {m,m + 1}, let itj 
be the anchor node of v*. Since 0\ and #2 are both compatible with _T, we know 
that {u m ,Vm\ and {u m +i, i> m +i} are two distinct cherries in r^ +1 = -T m 2 +1 . 
Therefore, we have 

K: (K:i; (gwo) = ft££ (rc^ (G m+1 )), 

because the four nodes u m , w m , u m +i and u m +i are distinct. □ 



